perm filename VOTYPE[8,ALS] blob
sn#041488 filedate 1973-05-09 generic text, type T, neo UTF8
00010 A.L.Samuel May 8, 1973
00020
00030 The Case for the Voice Typewriter
00040
00050 While the present trend in the computer handling of
00060 speech is largely directed toward Speech Understanding rather
00070 then Speech Transcription, a case can be made for the proposition
00080 that the understanding of speech in the strict sense is a very
00090 much harder task than is the transcription of speech from its
00100 acoustic form to the conventional representation of this speech
00110 in ordinary written form. If speech understanding is harder than speech
00115 transcription then it seems that it might be well to reconsider the
00117 entire direction of current work. Without exception, all of the ARPA
00120 Speech Recognition groups are restricting themselves to very
00130 limited domains of discourse, so limited in fact that it is
00140 extremely doubtful that their final systems will be of any use
00150 commercially, at least in the forseeable future. The same objections
00160 can be made to the work of several industrial organizations,
00170 altho in those cases where I have any detailed information it seems
00180 to me that a more realistic approach is being followed and
00190 grandiose claims are not being made. A physically realizable
00195 speech transcription system might not be so severely limited in its
00197 scope and might be a more practical system for current exploitation.
00200
00210 It is my belief that any industrial organization that
00220 wants to cash in on Speech Recognition should have two parallel
00230 programs, one along the conventional speech understanding approach
00240 to guarantee access to information that does come out of the
00250 ARPA work, and a second on speech transcription. The first of
00260 these programs should be conducted with a minimum of commercial
00270 secrecy, just enough to keep the outside world guessing as to
00280 what is being done but not so much as to errect barriers between
00290 the organizations workers and the outside or to lead others to
00300 suspect that there is anything else going on. The second program
00310 could then be run in complete secrecy so that the organization
00320 might reasonably lead the pack in the first commercial exploitation
00330 of a speech transcription system.
00340
00350 Let me outline my present conception of what a commercial
00360 Speech Transcription system might be like. It is, of course, quite
00370 unrealistic to expect that one can construct a complete set of
00380 specifications for such a system at this time since there are still
00390 a great many unknowns with respect to the specific problems that
00400 will be encountered and the compromises that will have to be mad⎇,
00410 but here goes. In the first place, the first system to be introduced
00420 will undoubtedly require a rather large computer altho at the present
00430 rate of miniaturization and cost reduction this requirement can be
00440 expected to ease considerably. Assuming a rather expensive computer
00450 one would expect that the first users would be the larger companies
00460 that now have rather large typing pools. The system should therefore
00470 be one that could be phased in under these conditions and could make
00480 use of the displaced typists in some capacity so as to increase the
00490 total typing service capabilities of the overall operation. The
00500 better typists could be promoted to private secretarial jobs, the
00510 next level personell would be retained to provide a post editing
00520 function and some saving in costs could be achieved by dispensing
00530 with the rest of the staff either by normal attrition or otherwise.
00540
00550 Users of the system would dictate their letters to the
00560 system via telephone lines. No attempt to operate in real time would
00570 be made. Instead, a limited amount of processing only would be done
00580 at the time of generation, only enough to get the input into digital
00590 form with perhaps some condensation to save on storage space. There
00600 would be a personallizes data bank for each regular user of the system
00610 which would be used in the subsequent processing. Incoming letters
00620 would be stacked in queues with any desired priority provisions and
00630 then processed from these queues. The processed letters could then
00640 be either typed out directly and delivered to the originator or they
00650 could be displayed to the originator on a scope for editing and
00660 correction. Users rating the service could have their letters delivered
00670 to a secretary or a member of the typing pool for preliminary editing
00680 and correction before the letters would go to the originator. Again,
00690 this could be in the form of hard copy or of a scope display. The
00700 overall system would have to include a rather good editing program
00710 so that the correction and editing of the letter could be done with
00720 a minimum of effort on the part of the corrector.
00730
00740 It might be well to discuss the minimum requirements as to
00750 accuracy that such a system would have to meet in order to be usable.
00760 The basic requirement would be that the initial output would have to be
00770 completely understandable. A second requirement would certainly be
00780 that the task of correcting the errors must not be more difficult
00790 and time consuming than the task of transcribing the letter by
00800 conventional methods. Perhaps a fairer comparison would be with the
00810 existing magnetic typewriters which provide correction facilities.
00820 Finally one would hope that there would be a sufficient margin between
00830 costs and the quality of output obtainable to make it possible to
00840 develop an adequate market.
00850
00860 The next question is not so much whether it will be possible
00870 to develop such a transcription system, because there is little doubt
00880 but that it can and will be done, but rather,how long will it take and
00890 what it will cost. No very precise answer can be given to these
00900 questions, except to say that the time will depend on the level of
00910 effort and the cost can be controlled by holding the level of effort
00920 to the minimum amount consistant with the objective of getting there
00930 first. The only practical answer would be to start now with a modest
00940 effort and to speed up the effort or slow it down as dictated by
00950 the rate of progress and the clues one might get as to the
00960 level of outside competition.
00970
00980 Some general remarks may be in order as to how one might
00990 proceed to develop a voice typewriter and as to how this effort
01000 would depart from work on Speech Understanding. In the first place,
01010 the entire approach should be based upon the initial identification
01020 of objects smaller than words or phrases, since the number of
01030 different words that would have to be stored and referenced for a
01040 strictly word recognition system to work would be entirely too large.
01050 Word recognition would be used of course to perform the final
01060 transcription from some pronetic representation into the final written
01070 form since the spelling of English words is illogical and inconsistant
01080 but it should still be possible to render a reasonable transcription of
01090 words that do not appear in the available dictionary. There is some
01100 question as to whether the initial identification should be in terms
01110 of syllables, phonemes or even smaller units. The identification of
01120 these units would be an individuallized matter since speakers differ
01130 in their rendition of phonetic events. The Signature Table approach
01140 as currently under study in the A.I. project at Stanford University
01150 seems to offer a very convenient method of acquiring the individuallizes
01160 data that is needed. Once this initial identification has been made the
01170 subsequent processing is reasonably independent of speaker idiosyncrasies.
01180 From here on the work that is currently being done by the various ARPA
01190 groups can be called upon for ideas as to how to proceed. However in
01290 the case of speech transcription use would be made of phonological
01390 rules, dynamic pattern matching, linguistic constraints and meaning
01490 to resolve ambiguities in the classification of the basic units,
01590 and not to establish meaning per se. By way of contrast, the emphasis
01690 of the speech understanding work is on the establishment of meaning
01790 without any serious attempt to identify the written equivalent of what
01890 was actually said.
01990
02090 Finally one must draw a clear distinction between the approach
02190 that should be taken to acquire the information that will be needed
02290 to design a speech transcription system and the actual design of th
02390 system. Clearly there is a real advantage in terms of flexibility
02490 in doing every thing by programming during the study phase while a
02590 real gain in speed for the final operating system can be obtained
02690 by designing special purpose hardware for the execution of highly
02790 repetative portions of the analysis. This distinction has all too
02890 often been obscured by workers in the field who start their study
02990 by making restrictive choices as to hardware before information is
03090 available to permit an intelligent choice.